b'      OFFICE of \n\n INSPECTOR GENERAL \n\n\n\nDate           May 4, 2011\nReply to\nAttn of        Office oflnspector General (OIG)\n\nSubjectManagement Letter No. 11-12, Limitations on the ability to ingest, search and access\nrecords in the Electronic Records Archives\n\nTo         :   David S. Ferriero, Archivist of the United States (N)\n\nThe National Archives and Records Administration (NARA) is in the final developmental phase\nof the Electronic Records Archive Program (ERA). Throughout the six years since Lockheed\nMartin Corporation (LMC) was awarded the contract to build ERA, the Office of Inspector\nGeneral (OIG) has asked fundamental questions of ERA program managers, employees,\ncontractors and senior NARA officials. The most basic being, "At full operational capability\n(FOC), will the common citizen be able to effectively access and research the electronic records\nthey are entitled access to over the internet?" We believe the answer, with limited caveats, is no.\nLimitations on search capabilities combined with constraints on secure data ingestion will result\nin a scaled back FOC failing to meet the most basic requirement of providing timely, effective\naccess to public records in NARA\'s holdings in a searchable manner over the internet.\n\nThe ERA, as a whole, is comprised of separate instances, or systems, which can be tailored to\ncertain needs. Thus, there is an Executive Office of the President (EOP) instance dedicated to\nPresidential records, a Congressional instance, a Census instance, etc. However, the Base ERA\ninstance is the main system where the vast majority of federal agencies\' records will be stored.\nThe Online Public Access (OPA) program will serve as the public\'s interface to research Base\nERA records.\n\nAs explained by NARA officials, the records in Base ERA will not be content searchable. Only\nthose records which NARA decides to copy from Base ERA and put into a new, as yet\nundeveloped, intermediary system will actually have their contents searchable by OPA\'s\nprogram. Obviously, some records will have to be withheld for security, privacy, and other\nlegitimate issues. However, as explained by NARA officials, not all records the public has the\nright to access will be copied to the OPA internlediary to be searchable. These limitations on the\ncontent-based search capability of ERA were discussed previously in Management Letter 11-08,\nElectronic Records Archives Lacks Ability to Search Records\' Contents, dated January 5,2011.\nAs serious as these limitations are, they are not the most pressing concern at this time. As\ncurrently planned, the intermediary for OPA will not even be developed at FOC. Thus, there\n\n\n\n                                       NARA\'s web site is http://www.nara.gov\n\x0cwill not be a method for OPA to connect with and search Base ERA for any "new" records\ningested. OPA\'s access will be limited to records described or searchable in NARA\'s currently\navailable legacy systems, such as the Access to Archival Databases (AAD), which were in use\nprior to ERA\'s development. NARA reports that new records ingested into Base ERA may be\nmanually reviewed, copied and put into one ofNARA\'s legacy systems to make them available\nto OPA searches. However, such a manually intensive process is likely to be overwhelmed by\nthe vast troves of electronic records warranting public access which are slated to flow into the\nBase ERA from federal agencies. The cumulative effect is likely to be that significant quantities\nof records warranting public access will not be accessible by researchers over the Internet. For\nexample, presently Base ERA holds approximately 16,777,216 megabytes of records, and only\n23 files comprising approximately 125 megabytes are searchable by OPA. These files come\nfrom only one series of records "County Business Patterns" covering 1970 to 1973. We\nunderstand ERA has not reached FOC and this example has limitations, but we believe it is\nindicative ofthe issues arising from the manual process of how the ERA search function gains\naccess to records.\n\nTo this assessment, we add two new concerns pertaining to the ingestion process for any record\nto enter into Base ERA in the first place. First, NARA has implemented a process for screening\nfor classified records that appears likely not only to fail to effectively screen records for national\nsecurity classified information, but also to add such burden it will immensely delay the speed by\nwhich records are ingested. Second, the OIG was originally told this program would be used to\nautomate a process for screening records for privacy related and personally identifiable\ninformation (PU). We were subsequently informed NARA is not planning on developing any\nautomated system to assist in screening records for PU before they are made available to the\npublic. No finalized program or policy for screening ERA records for PU or other privacy\nrelated information has been conveyed to the OIG. When asked, NARA officials have indicated\nan archivist may be required to personally view and screen each of the impossibly immense\nnumber of files the ERA will receive.\n\nAs envisioned, originating agencies would transfer their electronic records to NARA based on\ntheir NARA-approved records schedule. Base ERA is not a national security classified system,\nso in theory, no agency will send any classified records to Base ERA. However, in reality,\nNARA must plan for the fact classified records may accidentally or mistakenly be transferred to\nBase ERA. This is referred to as "spillage." Thus, NARA officials have decided to scan records\nfor classified content using a freeware tool identified as Lucene, before the records are actually\ningested into Base ERA. Lucene works by searching for certain words and phrases provided by\nNARA. Any file containing these words or phrases in certain amounts would have to be taken\nout of the transfer, quarantined, and returned to the originating agency. This pre-ingest screening\nis all the more important as there is currently no way to search the full text content of all records\nin Base ERA.\n\nThere are several issues with the scanning process. The Lucene program requires an adjustment\nor add-on for each type of file it needs to search (i.e., Wordperfect, Word, PDF, LotusNotes,\netc.). For Lucene to be effective as a systemic solution, NARA must identify every type of\nprogram used in the federal government and continually update Lucene as new program types\n\n\n\n                                                Page 2 \n\n                                 NARA\'s web site is http://www.nara.gov \n\n\x0care used. However, NARA does not currently have a list of all types of programs used by the\ngovernment (and legacy programs as agencies send older files to NARA), and they do not appear\nto be planning to do this for the past or future. Furthermore, Lucene has no optical character\nrecognition capacity and cannot search image-based files like scanned jpeg 1 files, photos or\nsimilar items. Additionally, Lucene does not search file names, even for those types of files .\nwhere it cannot search their content. For instance, a file labeled "Top Secret - Nuclear Weapon\nDesign - Top Secret" and containing properly marked, scannedjpeg copies of missile designs\nwould not be identifed. We believe the totality of these issues poses a significant risk of\nallowing classified files to slip past this system.\n\nFor those files Lucene does search, it looks for terms and use tendencies. At the very start, the\nproduction of such a list of terms or phrases to look for would be problematic. Many relevant\nterms to search for would themselves be classified and would have to be continuously updated.\nTerm-based searching is likely to generate large volumes of "false positives" based upon the\ndefined parameters of the search, as identified by this office during the investigation ofthe\nmissing 2-terabyte Clinton White House hard drive. Even ifthe false positives comprise a very\nsmall percentage of the transferred files, the ERA is supposed to be receiving such vast quantities\nof information that the number of false positives could become overwhelming. For any file\n"flagged" by Lucene, NARA will presumptively treat it as classified and return it to the sending\nagency for a determination of whether or not the file is releasable. This is likely to lead to large\ndelays as high numbers of files are sent back to the agencies under the strict controls of classified\ninformation for review. Since files are not generally transferred to NARA contemporaneously\nwith their original creation, it is likely that the file creators may no longer be at the agency, or the\nparticular program may even be expired. At present, there is no simplified procedure to return\nthe files and get them cleared or inspected in a timely fashion. If one imagines ERA as a busy\nsix-lane highway moving an immense amount of traffic, this part of the ingest procedure is akin\nto closing five lanes for a stretch. While the rest of the highway remains capable of transporting\nall the traffic, the back-up or bottleneck caused by that one stretch makes it impractical to use the\nroad.\n\nFinally, neither Lucene nor any other technology-based solution is being used to attempt to\nscreen records for PH or other privacy data. For example, Lucene is capable of searching for\nnumber patterns indicative of Social Security numbers, but NARA has not configured our system\nto do so. According to LMC officials there has been no direction from NARA about what to do\nwith PH in ERA. Again, no finalized program or policy for screening ERA records for PH or\nother privacy-related information has been conveyed to the DIG. When asked, NARA officials\nhave indicated it may require an archivist to personally view and screen each of the impossibly\nimmense number of files the ERA will receive. The replication of such antiquated paper-based\nprocesses in ERA yields only one outcome, a system so hampered and slowed by manual inputs\nit will be swamped beyond its means by the sheer numbers of electronic records NARA should\nbe preserving for the nation.\n\n\n\n1 A common file type used for digital images and photos.\n\n\n\n\n                                                   Page 3 \n\n                                    NARA\'s web site is http://www.nara.gov \n\n\x0cThis letter is not intended to simply convey the deficiencies in the technology employed during\ningest screening. At its core, this is also a policy issue. Lucene was selected based on the\nrequirements given by the ERA program. Senior ERA officials reported it took more than two\nyears to develop these requirements, and yet the only requirement agreed upon was that the tool\nshould screen for given keywords. This approach was defective from the start for the\nbottlenecking reason stated above, and the OIG has not received any comprehensive policy\ndetermination on how to handle screening for PII. We realize this screening issue is a hard\nproblem and that presently there may be no tool which can resolve the issue on its own. Thus the\nfocus should not be exclusively on the functions of the tool used in this process. What is also\nneeded is a concerted effort to formulate a set of policies that untangle these knots and refine a\nset of rules capable of being implemented. For example, a rule might shift more requirements to\nfederal agencies for scanning and certifying their records are free of sensitive materials before\ndelivering to NARA, etc. Ifpolicy cannot be articulated in a clear and concise way, then there is\nno tool that can implement it.\n\nWe are concerned that pertinent stakeholders are not aware of the currently planned search\nlimitations of ERA at FOe. Further, we do not believe potential spillover and PII issues have\nbeen adequately addressed in a manner providing for an efficiently working system capable of\nhandling the amount of records expected to come to ERA.                                  .\n\nIf you have any questions concerning the information presented in this Management Letter, \n\nplease contact me at (301) 837-1532. \n\n\n\n\n~~/L/\nPaul Brachfeld        ~         .\n\nInspector General \n\n\n\n\n\n                                              Page 4 \n\n                               NARA\'s web site is http://www.nara.goY \n\n\x0c'